library(tidyverse, quietly = TRUE)
library(plotly) # 3D plotting
#set seed for reproducibility
set.seed(2020)
We like to throw around the term “dimensionality reduction” a lot, but what is that? In some ways, it is just what it says, reducing the number of dimensions in our data set, which is probably not very helpful. What do we mean by a dimension?
A dimension, in this context, is a measurement that we have for all of our samples. In the RNA-seq context, one dimension is the expression of a single gene. We have tens of thousands of genes that we have measured, so we have tens of thousands of dimensions.
Perhaps unsurprisingly, this can be very hard to visualize, as our brains have trouble thinking in more than 3 dimensions, and paper or computer screens effectively only show two at a time (maybe 3 with color and color vision). We have some techniques that are useful, such as heatmaps of the gene expression matrix, but these can be hard to interpret, and it is extremely hard to accurately show relationships among samples in a grid (hierarchical trees are useful, but also confusing and easily misinterpreted).
So what we would like to do is take this information with all of its dimensions and reduce it down to something we can visualize, while maintaing as much of the underlying information as possible.
To illustrate this, we will use a fun little data set called Palmer Penguins. This is contained in an R package may not be installed on your machine, so you can install it for your user with:
install.packages("palmerpenguins")
We can then load the dataframe into our environment and look at its structure with:
penguins <- palmerpenguins::penguins
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
What we have is a data frame (tibble, really) with 8 columns, so 8 dimensions!
There is some missing data in there, so let’s clean that up with tidyr::drop_na(), which will remove any rows with missing values:
penguins <- drop_na(penguins)
We will start by focusing on the numeric dimensions: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. That is still 4 dimensions, which is going to be hard to visualize, but we can try.
We’ll start with an interactive 3d plot, using a package called plotly. The main function plot_ly() works much like ggplot, but with a somewhat different format for its arguments, and using pipes (%>%) instead of plusses (+)
plot_ly(penguins,
x = ~ bill_length_mm,
y = ~ bill_depth_mm,
z = ~ flipper_length_mm,
color = ~ body_mass_g) %>%
add_markers(size = 0.5)
You can drag around the 3d plot there to get different views, but it can be hard to find one that shows all of the structure in the data from a single view. Nonetheless, it should be clear that there are some views that are better than others, and if we display the data from those perspectives, it will show more of the variation in the data set than others where all the points are stacked on top of one another
This is where dimensionality reduction will help out!
The simplest from of dimensionality reduction is to just throw out a dimension. This is what we did when we selected only the most variable genes. As you might guess, this can be fraught, but it is also what happens any time we make a plot of just two dimensions of our data.
Below is a plot of just the bill length and bill depth. There is nothing wrong with this plot, but it is certainly not a view of the full data set, and it probably doesn’t show the same clarity of clustering that you might have been able to infer from playing with the 3D plot.
ggplot(penguins, aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
Principal components analysis (PCA) applies linear algebra to find the axis of greatest variation in the data matrix, then the axis perpendicular to that with the most remaining variation, then the one perpendicular to those with the most remaining, etc. (If you remember your linear algebra, these are the eigenvectors of the matrix.) We scale these axes to have a mean of zero and variance of 1 for display, but it is important to know that they do not account equally for the variation in the original data set.
It is also important to note that mathematically we will always end up with the same number of PCs as dimensions we started with; our dimensionality “reduction” comes from not looking at the PCs that do not account for much variation. (Some implementations will throw out lower PCs by default, including the one in scater::calculatePCA())
Let’s illustrate this process using the penguin data, using bill depth and bill length, so we can compare the result to our 2D plot above. We will use the base R function prcomp() to apply principal component analysis. Since the scale for bill length and depth are quite different, we will use the argument scale to first transform the data so that each dimension has the same variance before PCA. This prevents the larger variable from dominating the results.
penguin_bills <- penguins[,3:4]
penguin_bill_pca <- prcomp(penguin_bills, scale = TRUE)
This returns a list, with the PCA transformed data is in the $x position, so lets plot the first two PCs from that matrix:
ggplot(as.data.frame(penguin_bill_pca$x),
aes(x = PC1, y = PC2,
color = penguins$species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
If you compare this to the plot above, you should see that it is just a rotation and scaling of the original data. If we looked at just PC1, we would retain most of the variation in the data.
Now let’s move up to three dimensions, adding in flipper length!
penguin_lengths <- penguins[,3:5]
penguin_length_pca <- prcomp(penguin_lengths, scale = TRUE)
ggplot(as.data.frame(penguin_length_pca$x),
aes(x = PC1, y = PC2,
color = penguins$species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
It turns out that this looks quite similar overall, though some points may have moved. This is because it turns out that there are some reasonable correlations between dimensions in the data, and in particular between bill width and flipper length:
cor(penguins$bill_length_mm, penguins$flipper_length_mm)
[1] 0.6530956
PCA is in effect rotating the data to remove correlations between the new axes, and correlated data end up collapsed into a single dimension.
If you play around with the 3D plot, you should be able to find a rotation that gives a similar view to the plot above.
Note that this plot does not necessarily give the best separation of species! The PCA we performed was blind to that information, and only finds a projection based on variation, not species id.
We can still plot the remaining dimension, just to see if it retains information we care about:
ggplot(as.data.frame(penguin_length_pca$x),
aes(x = PC3,
fill = penguins$species)) +
geom_histogram() +
colorblindr::scale_fill_OkabeIto()
There may still be some information there, especially distinguishing Adelie and Chinstrap penguins, but it is likely less than we get from the first two dimensions, so we may be comfortable discarding it.
UMAP and t-SNE allow non-linear transformations of the dimensions, but perform the conceptually similar task of creating a transformation that captures most of the variation in the multidimensional data in two axes.
Let’s see what the UMAP projection looks like:
penguin_length_umap <- uwot::umap(penguin_lengths)
ggplot(as.data.frame(penguin_length_umap),
aes(x = V1, y = V2,
color = penguins$species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
You can see that this has separated the data into much more discrete clusters (that do not always align with species), but that the data no longer look much at all like the original data.
In this data, the results are also unstable: repeating the above code gives a different looking result!
Let’s see what the UMAP projection looks like:
penguin_length_umap <- uwot::umap(penguin_lengths)
ggplot(as.data.frame(penguin_length_umap),
aes(x = V1, y = V2,
color = penguins$species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
What about t-SNE? Let’s have a look:
penguin_length_tsne <- Rtsne::Rtsne(penguin_lengths)
ggplot(as.data.frame(penguin_length_tsne$Y),
aes(x = V1, y = V2,
color = penguins$species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
The extension of these techniques to more dimensions is mathematically relatively straightforward, though it gets harder for our brains to envision. We can easily add in the mass of the penguins for a fourth dimension:
penguin_all_pca <- prcomp(penguins[,3:6], scale = TRUE)
ggplot(as.data.frame(penguin_all_pca$x),
aes(x = PC1, y = PC2,
color = penguins$species)) +
geom_point() +
colorblindr::scale_color_OkabeIto()
Extending this idea to thousands of dimensions may make your head hurt, but hopefully you can start to get the idea! And it is certainly easier than trying to actually plot thousands of dimensions at the same time!
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.9.2.1 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0
[5] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3
[9] ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lubridate_1.7.9 lattice_0.20-41
[4] FNN_1.1.3 palmerpenguins_0.1.0 assertthat_0.2.1
[7] digest_0.6.25 RSpectra_0.16-0 R6_2.4.1
[10] cellranger_1.1.0 backports_1.1.8 reprex_0.3.0
[13] evaluate_0.14 httr_1.4.2 pillar_1.4.6
[16] rlang_0.4.7 lazyeval_0.2.2 readxl_1.3.1
[19] rstudioapi_0.11 data.table_1.13.0 blob_1.2.1
[22] Matrix_1.2-18 rmarkdown_2.3 labeling_0.3
[25] Rtsne_0.15 htmlwidgets_1.5.1 munsell_0.5.0
[28] uwot_0.1.8 broom_0.7.0 compiler_4.0.2
[31] modelr_0.1.8 xfun_0.16 base64enc_0.1-3
[34] pkgconfig_2.0.3 htmltools_0.5.0 tidyselect_1.1.0
[37] fansi_0.4.1 viridisLite_0.3.0 crayon_1.3.4
[40] colorblindr_0.1.0 dbplyr_1.4.4 withr_2.2.0
[43] grid_4.0.2 jsonlite_1.7.0 gtable_0.3.0
[46] lifecycle_0.2.0 DBI_1.1.0 magrittr_1.5
[49] scales_1.1.1 cli_2.0.2 stringi_1.4.6
[52] farver_2.0.3 fs_1.4.2 xml2_1.3.2
[55] ellipsis_0.3.1 generics_0.0.2 vctrs_0.3.2
[58] tools_4.0.2 glue_1.4.1 hms_0.5.3
[61] crosstalk_1.1.0.1 yaml_2.2.1 colorspace_1.4-1
[64] rvest_0.3.6 knitr_1.29 haven_2.3.1